{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inferring gene regulatory networks (GRNs) with SCENIC\n",
    "\n",
    "### Intro to SCENIC \n",
    "\n",
    "SCENIC is a tool to infer Gene Regulatory Networks and their associated cell states from **single-cell RNA-seq** data. \n",
    "\n",
    "A typical SCENIC workflow includes the following steps:\n",
    "\n",
    "0. Setting up your dataset\n",
    "\n",
    "Building the **gene regulatory network (GRN)**:\n",
    "\n",
    "1. Identify potential targets for each TF based on co-expression (\"co-expression modules\").\n",
    "\n",
    "    - Tool: GENIE3 or GRNBoost. \n",
    "\n",
    "\n",
    "2. Select potential direct-binding targets (\"regulons\") based on DNA-motif analysis \n",
    "    \n",
    "    - Tool: RcisTarget\n",
    "\n",
    "Identify **cell states** and their **regulators**:\n",
    "\n",
    "3. Analyzing the network activity in each individual cell\n",
    "\n",
    "    - Tool: AUCell\n",
    "    - Scoring regulons in the cells (calculate AUC)\n",
    "    - Optional: Convert the network activity into ON/OFF (binary activity matrix)\n",
    "\n",
    "4. Identify stable cell states based on their gene regulatory network activity (cell clustering) \n",
    "and exploring the results...\n",
    "\n",
    "   ![SCENIC.png](SCENIC.png)\n",
    "\n",
    "For more details on how and why SCENIC folows these steps, and some usage examples, you can check the original [paper](https://www.nature.com/articles/nmeth.4463).  \n",
    "\n",
    "Note that by default SCENIC is only available for three speciess: human, mouse and fly. For other species (or personalized motif collections) you would need to [create your own databases](https://github.com/aertslab/create_cisTarget_databases)). \n",
    "\n",
    "\n",
    "### Implementation: \n",
    "\n",
    "There are currently implementations of [SCENIC](https://scenic.aertslab.org) in [R](https://github.com/aertslab/SCENIC/) and in [Python](https://github.com/aertslab/pySCENIC). For running it in batch  on multiple datasets, or on big datasets, we normally recommend to use **[VSN](https://github.com/vib-singlecell-nf/vsn-pipelines)** (a *Nextflow workflow*), which highly automates and simplifies the analysis, as explained in the [SCENIC protocol paper](https://doi.org/10.1038/s41596-020-0336-2) and [examples](https://github.com/aertslab/SCENICprotocol/). \n",
    "The output from VSN can be explored from either R or Python, and the web browser (SCope).\n",
    " \n",
    "In this workshop, we will run SCENIC using **VSN** (Notebook 1), and explore the output in **SCope and R** (Notebook 2):\n",
    "\n",
    "- Notebook 1: Running SCENIC (VSN: includes Steps 1-4)\n",
    "\n",
    "- Notebook 2: Exploring SCENIC's output (R & SCope)\n",
    "\n",
    "For advanced users who might want more details, modify the algorithm, or run only parts of it, detailed tutorials explaining each step in deatail are avilable in R (e.g. \n",
    "[step by step](http://htmlpreview.github.io/?https://github.com/aertslab/SCENIC/blob/master/inst/doc/SCENIC_Running.html) and explanation of source code for [regulons](https://github.com/aertslab/SCENIC/blob/master/vignettes/detailedStep_2_createRegulons.Rmd)).\n",
    "\n",
    "\n",
    "----\n",
    "\n",
    "*Where to find help:*\n",
    "\n",
    "When you are doing your own analysis, you will likely bump into problems. In that case: \n",
    "\n",
    "* **Read the error message**: most of the time they are self-explanatory\n",
    "* Check the **help page of the function** you're using (e.g. `?get_regulons`)\n",
    "* Google that error message: if it's not clear to you what the problem is, just Google it; scRNA-seq community is very active (e.g. on BioStars or on stackoverflow).\n",
    "* Find and follow the package [tutorials](https://github.com/aertslab/SCENIC) (in R they are called \"vignettes\") and the links we have left along this notebook.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Notebook 1: Running SCENIC\n",
    "\n",
    "The **input** to SCENIC is the single-cell RNA-seq expression matrix (i.e. the value of each field in the matrix corressponds to the expression of a gene in a cell):\n",
    "\n",
    "- The gene/feature ID should be the gene-symbol (for compatibility with cisTarget annotation databases).\n",
    "\n",
    "- Expression units: The preferred expression values are gene-summarized counts. There is currently not a strong consensus towards using the raw counts (e.g. from CellRanger), or counts normalized through single-cell specific methods (e.g. Seurat). \n",
    "Other measurements, such as transcripts/counts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis (first step of SCENIC) because they may induce artificial co-variation (Crow et al. (2016)). The choice of input expression matrix might have some effect on the co-expression analysis to create the regulons (step 1). The other steps of the workflow are not directly affected by the input expression values: (2) The expression is not taken into account for the motif analysis, and (3) AUCell, which is used for scoring the regulons on the cells, is cell ranking-based (it works as an implicit normalization). Overall, SCENIC is quite robust to this choice, we have applied SCENIC to datasets using raw (logged) UMI counts, normalized UMI counts, and TPM and they all provided reliable results (see Aibar et al. (2017)).\n",
    "    \n",
    "### Running SCENIC through VSN\n",
    "\n",
    "This part of the workshop is based on the\n",
    "[VSN pipeline tutorial](https://vsn-pipelines-examples.readthedocs.io/en/latest/PBMC10k.html).\n",
    "\n",
    "For more complex examples, i.e. explaing gene filtering, see the SCENIC [protocol example](https://htmlpreview.github.io/?https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/PBMC10k_SCENIC-protocol-CLI.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Before starting..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Check system requirements: VSN requires `Nextflow` and a container system (`Singularity` or `Docker`). VSN is run directly on a terminal.\n",
    "\n",
    "> Make sure to have matching versions. e.g. at the time of preparing this notebook: `VSN 0.24.0 -> Nextflow 20.04.1`\n",
    "and `VSN 0.25.0 -> Nextflow 20.10.1+`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "module load Nextflow/20.10.0 # only needed in some systems\n",
    "nextflow -version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "singularity version "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also make sure that LANG and LC_ALL environment variables have been set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "locale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If some are not set, you can set them to the default language for instance:\n",
    "export LANG=\"C\"\n",
    "export LC_ALL=\"C\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Set your working directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ln -s /anywhere/Projects/Tutorials/ ~\n",
    "cd ~/Tutorials/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir -p SCENIC_pbmc\n",
    "cd SCENIC_pbmc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Prepare dataset\n",
    "\n",
    "This tutorial analyzes a typical scRNA-seq dataset with a single run: **Blood cells (PBMC)** data available from the 10x Genomics support website. \n",
    "\n",
    "- The “Feature / cell matrix (filtered)” data can be downloaded from 10x Genomics website (It is the typical output from their pipeline, after running Cell Ranger)\n",
    "\n",
    "- This dataset includes **10k cells**. It would take approximately **2 hours** to run SCENIC on it with a HPC system using 15 processes. Therefore, in this tutorial we will not *run* the analysis. Instead, we will prepare the inputs, and in the second part of the tutorial, we will directly explore the .loom file that was produced as output from a previous run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_filtered_feature_bc_matrix.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using 10x data as an input, the pipeline assumes the files are in the typical Cell Ranger directory structure (`<datasetName>/outs/`). Therefore, we will extract the files into a folder following that naming scheme:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create directory structure:\n",
    "mkdir -p pbmc10k_data/outs/\n",
    "# Extract the file:\n",
    "tar xvf pbmc_10k_v3_filtered_feature_bc_matrix.tar.gz -C pbmc10k_data/outs/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# See the resulting files:\n",
    "tree pbmc10k_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following step, when setting up the nextflow config file, the `tenx` input channel should point to the outs folder: `pwd + 'pbmc10k_data/outs'`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Setup the `VSN-pipelines` project\n",
    "\n",
    "First, pull the VSN repository in Nextflow. The `-r` flag can be used to specify the pipeline version to use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nextflow pull vib-singlecell-nf/vsn-pipelines -r v0.25.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, define the run **settings** (a.k.a. create the `config` file):\n",
    "\n",
    "- `-profile`: Determines the parameters to include in the config file. Select the pipelines/settings that will be used in this specific analysis (i.e. this argument is to avoid creating huge file with *all* possibilities for *all* VSN-pipelines).\n",
    "\n",
    "In this case, we have used:\n",
    "\n",
    "- `tenx`: defines the input data type\n",
    "- `single_sample_scenic`: creates the basic parameters for the `single_sample` and `scenic` workflows (see https://vsn-pipelines.readthedocs.io/en/latest/pipelines.html for the list of available pipelines and their options)\n",
    "- `scenic_use_cistarget_motifs` and `scenic_use_cistarget_track`: includes parameters to specify the location of the cistarget database files (modify their location in the config file)\n",
    "- `hg38`: specifies the genome\n",
    "- `singularity` (or `docker`): specifies container system to use to run the processes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nextflow config vib-singlecell-nf/vsn-pipelines \\\n",
    "    -profile tenx,single_sample_scenic,scenic_use_cistarget_motifs,scenic_use_cistarget_tracks,hg38,singularity \\\n",
    "    > pbmc10k.vsn-pipelines.complete.config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will create the config file: `pbmc10k.vsn-pipelines.complete.config`\n",
    "\n",
    "Important variables to check/edit in the config file are:\n",
    "\n",
    "- `singularity.runOption` (or `docker.runOptions`): the correct volume mounts should be specified (requires the user home folder, and the location of the data).\n",
    "- `params.global.project_name` (optional): will control the naming of the output files.\n",
    "- `params.sc.scope.tree.level_${X}` (optional): controls the labeling of the loom file when uploaded to the SCope viewer.\n",
    "- `params.sc.scanpy.filter`\n",
    "- `params.sc.scanpy.feature_selection`\n",
    "- `params.sc.scanpy.clustering`\n",
    "- `compute_resources__`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Exercise:* \n",
    "Open the config file and edit the settings required to continue the tutorial, including: \n",
    "- the project name\n",
    "- `cellranger_mex = 'pbmc10k_data/outs/'`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Launch the run\n",
    "\n",
    "**First pass: Check filtering settings**\n",
    "\n",
    "While the overall goal is to run SCENIC, the VSN pipeline also includes preprocessing and filtering steps.\n",
    "Those filtering settings should be checked to confirm they are the appropriate for the given dataset.\n",
    "\n",
    "To save running time, it is possible and recommended to make a first pass running only those pre-processing steps  (determined by the argument `-entry single_sample`).\n",
    "i.e.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "nextflow -C pbmc10k.vsn-pipelines.complete.config \\\n",
    "    run vib-singlecell-nf/vsn-pipelines \\\n",
    "    -entry single_sample \\\n",
    "    -r v0.25.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> The argument `-entry` determines the pipeline that will be run. Note that it will ony be possible to run the pipelines added to the *config file* in the previous step. In this case 'single_sample_scenic' is equivalent to running 'single_sample' + SCENIC. See [VSN's readthedocs](\n",
    "https://vsn-pipelines.readthedocs.io/en/latest/pipelines.html) for the list and description of the different pipelines available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting **QC reports** will be located in `out/notebooks/intermediate/pbmc10k.SC_QC_filtering_report.htm` (as ipynb, and converted html file). \n",
    "\n",
    "If needed, the cell and gene filters can be updated by editing the config file. \n",
    "\n",
    "For example, the filters used by default are:\n",
    "\n",
    "```\n",
    "params {\n",
    "    sc {\n",
    "        scanpy {\n",
    "            filter = {\n",
    "                cellFilterMinNGenes = 200\n",
    "                cellFilterMaxNGenes = 4000\n",
    "                cellFilterMaxPercentMito = 0.15\n",
    "                geneFilterMinNCells = 3\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "*Exercise:* Open the QC reports and have a look at the stats provided."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Second pass: Run SCENIC**\n",
    "\n",
    "Once the cell and gene filters look fine, we can re-start the pipeline to run SCENIC (setting `-entry single_sample_scenic`). \n",
    "\n",
    "Using  the `-resum` option will continue running the pipeline, skipping already completed steps: In this case, applying the correct filtering parameters, and continue to the upcoming SCENIC steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Uncomment this field to run it. Note: it will probably take several hours.\n",
    "# nextflow -C pbmc10k.vsn-pipelines.complete.config \\\n",
    "#     run vib-singlecell-nf/vsn-pipelines \\\n",
    "#     -entry single_sample_scenic \\\n",
    "#      -r v0.25.0 \\\n",
    "#     -resume"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Results & output\n",
    "\n",
    "The main SCENIC outputs (including regulons and cell projections based on regulon activity) are packaged into a **loom file**. The loom file also includes the results of the parallel expression analysis. \n",
    "\n",
    "- The loom file can be found at `out/loom/pbmc10k.SCENIC_SCope_output.loom`, and is ready to be uploaded to a [SCope](http://scope.aertslab.org/) session. \n",
    "The loom file from this analysis can be found on the [SCENIC protocol SCope session](https://scope.aertslab.org/#/Protocol_Cases/) (DSL2 > pbmc10k.SCENIC_SCope_output.loom).\n",
    "\n",
    "Other relevant output files include:\n",
    "\n",
    "- `out/scenic/pbmc10k_data/notebooks/SCENIC_report.html`: Notebook with an overview of the SCENIC workflow. The last heatmap (\"AUC Heatmap - Top 5 regulons from each cell type\"), provides an useful overview of the regulons in the cells.\n",
    "\n",
    "- `out/scenic/`: The other folders contain partial results from SCENIC pipeline, which can be useful for advanced users. For example, the **motif enrichment analysis** and **co-expression modules** (output of GRNBoost/GENIE3).\n",
    "\n",
    "- `out/data/pbmc10k.PBMC10k_DSL2.single_sample.output.h5a`: an anndata file generated by the **Scanpy** section of the pipeline, including the results of the expression analysis in addition to SCENIC(i.e. clustering based on highly variable genes).\n",
    "\n",
    "- The `work` folder contains temporary files to allow resuming the pipeline if needed. It can be deleted once the pipeline is finished.\n",
    "\n",
    "In the second Notebook, we will explore these outputs in SCope and R."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see the list of files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash 4.2.46 - r23i27n24",
   "language": "bash",
   "name": "rik_ssh_genius_r23i27n24_bash41"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
